Sys.setenv(LANG = "en")
require(mltools)
## Loading required package: mltools
require(data.table)
## Loading required package: data.table
require(rpart)
## Loading required package: rpart
require(party)
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
require(corrplot)
## Loading required package: corrplot
## corrplot 0.88 loaded
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ipred)
## Loading required package: ipred
require(class)
## Loading required package: class
require(ggbiplot)
## Loading required package: ggbiplot
## Loading required package: plyr
##
## Attaching package: 'plyr'
## The following object is masked from 'package:modeltools':
##
## empty
## Loading required package: scales
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
require(ROCR)
## Loading required package: ROCR
require(pROC)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(Rtsne)
library(ggplot2)
require(factoextra)
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
require(caret)
data <- read.csv(file = "C:\\Users\\von dawdop\\Desktop\\machine_learning\\bank-additional\\bank-additional-full.csv", sep = ';', na.strings="unknown")
#of course we have to delete duplicate rows before doing anything else
data <- unique( data )
data
As we can see the dataset is pretty big, we go around 42000 rows and 20 explanatory variables and:
table(data$y)
##
## no yes
## 36537 4639
so this dataset is unbalanced (we have to account of it during data preprocessing and building/measuring models). So, our task is first (after some datapreprocessing) delete some variables (dimensionality reduction, we will have a lot of dummys), then to cluster some of the dummy variables (for the further dimensionality reduction) and then make a classification of y using some machine learning models (mainly logistic regression, K nearest neighbors and random forest).
The duration variables tells how much time the last call took, so if duration=0 y will always be “no”. This makes variable rather useless, since we don’t know how much time the call will take (it also higly affects the outcome):
data$duration <- NULL
The features explanation will be done at 2), together with cleaning the dataset. 2) First of all, lets code some usefull functions (all of them are based on the code presented at the classes):
#This function plots the ratio of fail/succes (y.values will be always y column from our dataset) and makes a spline for x.values numeric variable
ratio_plot = function (x.values, y.values, x.name, dataset) {
N=nrow(dataset)
data <- as.data.frame.matrix(table(x.values, y.values))
data$ratio <- data[,1]/data[,2]
#removing infinite and nan values
data <- data[is.finite(rowSums(data)),]
data$percent_of_the_data= (data[,1]+data[,2])*100/N
# we will use some different collors: if particular variable is less than 0.5% of the dataset the dot on plot is red, if <5% its blue, the rest is gray
plot(rownames(data), data$ratio,
xlab = x.name, ylab = "fail/succes ratio", main = NULL, col = ifelse(data$percent_of_the_data<0.5, 'red', ifelse(data$percent_of_the_data<1, 'blue', 'gray')))
lines(smooth.spline(rownames(data), data$ratio), lwd = 2)
}
#This function plots the fail/succes ratio and makes a spline for x.values numeric variable
#This function makes a table with the ratio of fail/succes for all levels of value form x.values and presents them in decreasing (as % of the dataset) manner.
ratio_table_old= function (x.values, y.values, dataset) {
N=nrow(dataset)
data <- as.data.frame.matrix(table(x.values, y.values, useNA="always"))
data$ratio <- data[,1]/data[,2]
data$percent_of_the_data= (data[,1]+data[,2])*100/N
data <- data[order(data$percent_of_the_data, decreasing = TRUE),]
#data[,3] <-NULL
data
}
#the previous one will get "X" in front of non-nan values name so we will have another version of it.
ratio_table= function (x.values, y.values, dataset) {
N=nrow(dataset)
data <- as.data.frame.matrix(table(x.values, y.values))
data$ratio <- data[,1]/data[,2]
data$percent_of_the_data= (data[,1]+data[,2])*100/N
data <- data[order(data$percent_of_the_data, decreasing = TRUE),]
#data[,3] <-NULL
data
}
#Boxplot for a numeric variable
DoBoxplot = function(values, values.name) {
boxplot(values, horizontal = TRUE, xlab = values.name)
median.leg = paste("median = ", round(median(values), digits = 4))
q25 = quantile(values, 0.25)
q25.leg = paste("25th percentile =", round(q25,digits = 4),"\n")
q75 = quantile(values, 0.75)
q75.leg = paste("75th percentile =", round(q75,digits = 4))
legend(x = "top", median.leg, bty = "n")
legend(x = "bottom", paste(q25.leg, q75.leg, sep = ""), bty = "n")
}
#this one simply runs all 3 above
run3 = function(x.values, y.values, dataset, values.name) {
ratio_plot(x.values, y.values, values.name, dataset)
DoBoxplot(x.values, values.name)
ratio_table(x.values, y.values, dataset)
}
Now we should look where the missing data is:
colSums(is.na(data))
## age job marital education default
## 0 330 80 1730 8596
## housing loan contact month day_of_week
## 990 990 0 0 0
## campaign pdays previous poutcome emp.var.rate
## 0 0 0 0 0
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 0 0 0 0 0
table(data$y)
##
## no yes
## 36537 4639
ratio_table_old(data$job, data$y, data)
The ratio of no to yes in y in the whole dataset is around 7.87 and we should keep that in mind when removing observation (we dont want to remove too much of the yes because we will be forced to balance the data later on). We can now remove the rows with nans in marital (these are only 80 observation) and for the job (less than 1% of the dataset, the ratio is preety close to the one in the whole). For the job variable it may mean that those missing values are random ones (e.g. human error when imputing data into computer). So we can just remove it. As said in the data description on the website we should also remove the duration variable:
data <- data[!is.na(data$marital),]
data <- data[!is.na(data$job),]
table(data$loan, data$housing, useNA="always")
##
## no yes <NA>
## no 15890 17718 0
## yes 2530 3653 0
## <NA> 0 0 984
ratio_table_old(data$loan,data$y,data)
ratio_table_old(data$housing, data$y, data)
data$housing<-ifelse(data$housing=="yes",1,0)
data$loan<-ifelse(data$loan=="yes",1,0)
The housing and loan variables stays for if the person have housing or personal loan. As we can see those variables all have the nans in the same rows, so those lacks of information may be caused by the fact that the company employer didn’t menaged to gather information about it or by the human error. I also think that both variables are important, because haveing housing or personal loan means that you already needed some money and thus may have much less spare cash, the housing loan means that you propably have to pay a lot of cash each month and you may not have money to spare. So we will remove rows with nans are leave those variables (even throught the correlation with loan is so low):
data <- data[!is.na(data$loan),]
ratio_table_old(data$default, data$y, data)
ratio_table_old(data$education, data$y, data)
The defeault means if someone has a credit in default. This variable should be extreamly important, because if someone has a credit to payoff he will be way less likely to subscribe for a term deposit. As we can see there are only 3 “yes” and plenty of nans, so I belive that people haveing a credit in default may be hiding it. For this reason we will remove those 3 yes rows and make a new variable, if the defoult is known. For the education I believe it’s very important, since better education is correlated to higher payment, so more more cash to put into a term deposit. For the unknown education its about 4% of the dataset and the nans seems not to be just random noise and it also has very low ratio of no to yes, so we should try to fill those nans (we will also remove the illiterates because there are only 18 of them):
data <- data[!grepl("illiterate", data$education),]
data <- data[!grepl("yes", data$default),]
data$default_unknown<-ifelse(is.na(data$default),1,0)
data$default <- NULL
#For the contact column they are only 2 columns, so we can make it binary
data$contact_by_tel<-ifelse(data$contact=="telephone",1,0)
data$contact <- NULL
data$y<-ifelse(data$y=="yes",1,0)
#im factoring all the char columns except for education so I can one hot encode them with one command
data_for_estimation <-data
data_for_estimation$month <- as.factor(data_for_estimation$month)
data_for_estimation$job <- as.factor(data_for_estimation$job)
data_for_estimation$marital <- as.factor(data_for_estimation$marital)
data_for_estimation$poutcome <- as.factor(data_for_estimation$poutcome)
data_for_estimation$day_of_week <- as.factor(data_for_estimation$day_of_week)
data_for_estimation <- one_hot(as.data.table(data_for_estimation))
data_for_estimation$education <- as.factor(data_for_estimation$education)
educ_good <- data_for_estimation[!is.na(data$education), ]
educ_nan <- data_for_estimation[is.na(data$education), ]
spec = c(train = 0.6, test = 0.2, validate = 0.2)
set.seed(100)
g = sample(cut(seq(nrow(educ_good)), nrow(educ_good)*cumsum(c(0,spec)),labels = names(spec)))
data_g = split(educ_good, g)
data_new_g = split(educ_good, g)
educ_trainingset<- data_new_g[[1]]
educ_test <- data_new_g[[2]]
educ_validation <- data_new_g[[3]]
#for the decision tree Im using code from the lab
set.seed(100)
rpart_initial = rpart(education~.-y, data=educ_trainingset,cp=0 , method='class')
#Now we have to get the optimal cp (parameter related to when the tree stops spliting)
cp_optimal = rpart_initial$cptable[which(rpart_initial$cptable[,"xerror"]==min(rpart_initial$cptable[,"xerror"])),"CP"]
rpart_optimal = prune(rpart_initial, cp_optimal)
educ_rpart_prediction = predict(rpart_optimal, newdata=educ_validation, type = 'class')
mt_rpart = table(educ_rpart_prediction, educ_validation$education)
diagonal_sum = 0
for (i in colnames(mt_rpart)){
diagonal_sum = diagonal_sum + mt_rpart[as.character(i),as.character(i)]
}
#Accuracy
diagonal_sum/sum(mt_rpart)
## [1] 0.5259715
mt_rpart
##
## educ_rpart_prediction basic.4y basic.6y basic.9y high.school
## basic.4y 415 109 215 96
## basic.6y 16 17 13 4
## basic.9y 217 176 519 146
## high.school 61 77 151 794
## professional.course 18 18 90 161
## university.degree 92 57 183 600
##
## educ_rpart_prediction professional.course university.degree
## basic.4y 55 63
## basic.6y 0 2
## basic.9y 82 27
## high.school 95 294
## professional.course 625 316
## university.degree 189 1650
As we can see the accuracy is very bad, so we should just remove remove it:
data <- data[!is.na(data$education),]
We are done with the missing values so we can head into cleaning the data:
ratio_table(data$marital,data$y,data)
ratio_table(data$month,data$y,data)
ratio_table(data$day_of_week,data$y,data)
ratio_table(data$contact_by_tel,data$y,data)
The martial variable stays person martial status. As we can see, singles tend do use subscribe to the term deposit more often, it may be related to the fact that they are younger and may want to accumulate more wealth. For the month variable (when the call was made) it should not be too important (I don’t think that using term deposit is seasonal), but in our case it is. It may be the reason of the fact that the campaigns were conducted from 2008 to 2010 so it may vary because of the economic crisis in 2008 and recovering from it. As we can see october, september, march and december variables all contribute to a small part of the data and have very similar ratio, so we can just merge them. The day variable may be in fact important, because it may be the case that in the Portugal peoples usually get paid at the fridays so they are more likely to get a deposit. For the reason that we got data from 3 years it cannot be an accident that in the thuesday peoples are more likely to get it monday. I think that we should also merge the tuesday and wednesday because they got almost the same ratio. Contact_by_tel variable stays if the contact was performed by the telephone (1) or by the cellphone (0). I think that the reason why this variable is so important is that in 2008 a cellphone was still pretty rare, so it could be some of expensive item.
data$month <- ifelse(data$month %in% c("dec", "mar", "sep", "oct"), "3_7_9_12", data$month)
data$day_of_week <- ifelse(data$day_of_week %in% c("tue", "wed"), "tue_wed", data$day_of_week)
data <- unique(data)
Now we can head to the numerical variables:
run3(data$campaign, data$y,data, "campaign")
nrow(data[data$campaign <11,])
## [1] 35635
nrow(data)
## [1] 36433
Campaign is the number of contacts performed for the client during this campaign. As we should have expected, the ratio greatly increases with the number of calls and that’s obvious: you will get annoyed if you are getting called few times a month and if you have not accepted the offer yet you are unlikely to accept it during another call. Also keep in mind that this one also includes the last call, so the one we are asking if its worth calling.
As we can see there are plenty of values to be merge, so here comes the VERY important question: do we want to just model the dataset as good as possible or do we want to get a model that will predicts the outcome in the future? In the first case merging variables values may be a very bad idea, because for example it will lower the performance of a random forest (but also decrease computation time slightly). In the latter case we should remove the red dots and merge the blue ones, because the red one as surely outliners (I dont think that calling someone FOURTY TIMES must be an accident, calling so many times is just pointless) while the blue ones are just extreme cases of calling someone too many times. The another advantage of this approach is that we will balance the dataset a little bit.
So I will assume that we want to model the future, so we gonna merge the blue dots and remove the red ones (also then the relation will be nicely linear, t will be very hopeful for the logistic regression):
data <- data[data$campaign <11,]
data$campaign <- ifelse(data$campaign > 7, 9, data$campaign)
run3(data$pdays, data$y,data, "pdays")
Pdays is the number of days that passed from the last call during previous campaign, 999 means that he was not contacted. This one should also be important for the same reasons as for the campaign: if client was contacted and does not get the term deposit it is unlikely that he will do so now. Should of course make it binary (also note that it is different from the previous one, because this is about the previous campaign).
data$last_campaign_contacted <- ifelse(data$pdays==999 ,0,1)
data$pdays <- NULL
df <- data
run3(data$previous, data$y,data, "previous")
ratio_plot(exp(-data$previous),data$y, "exp(-previous)", data)
This variables stays for the number of contacts performed before this campaign and for this client (so I think it should account for the ALL previous campaigns). The numbers here are way lower than in the “campaign” and this means that the gross majority of the calls was made during the last year. In this case the ratio decreases with the number of calls, it may be because the calls were mostly made if there was an succes during the last campaign (and this does require at least one previous call, 2 calls means 2 successes in 2 different campaigns etc). We will also merge values higher than 2 and take the exp(-x) transformation to get it linear:
data$previous <- ifelse(data$previous>2, 3,data$previous)
data$exp_previous <- exp(-data$previous)
data$previous <- NULL
ratio_table(data$poutcome,data$y,data)
This variable stays for the outcome of the previous campaign. Of course if someone had accepted our offer some time ago he is way more likely to do it again, so this variable is very important. We should note that the nonexistent value is the same as being not contacted is in the last_campaign_contacted, so we should only encode if the outcome of the previous campaign was successful based on this one client.
data$prev_success <- ifelse(data$poutcome=="success", 1,0)
data$poutcome <- NULL
df <- data
run3(data$emp.var.rate, data$y,data, "emp.var.rate")
This variable stays for the cyclical employment variation (quarterly indicator), so it measures how much peoples were hired or fired due to shifts in the conditions of the economy. When the economy is in its peak this rate is lower, so it is easier to get a job and therefore there is more cash to be invested . Lets also observe that this relation is piecewise linear (keep in mind that this data is taken from only 3 consecutive year), but sadly we cannot make any reasonable values merges to improve its linearity, and removing some of the data (the ones that may be considered outliners) also makes no sense, because they fit the trends nicely.
run3(data$cons.price.idx, data$y,data, "cons.price.idx")
This variable is simply the measure of inflation (higher consumer price index means higher inflation). When the inflation is high people will tend to spend more money, and when its low they will tend to store them (e.g. on term deposits). In this case the relation looks like two linear ones: one for blue and red dots and one for the gray ones, yet again we can see that combining data from three years (after the economic crisis) does change a lot. For this reasons I will not make any merges (or removes) in this case, because we would loss some information about the trends in one or two years.
run3(data$cons.conf.idx, data$y,data, "cons.conf.idx")
table(data[-data$cons.conf.idx >35,]$y)
##
## 0 1
## 30122 3198
table(data$y)
##
## 0 1
## 31444 4191
Consumer confidence index measures the optimism of the consumers, so higher index should result in more term deposits (because there is no need to spend the money right now). Again, we can clearly see two distinct linear relations and we cannot remove the outliners because it would imbalance our data even further (we would lose the values with the lowest fail/success ratio)…
run3(data$euribor3m, data$y,data, "euribor3m")
The 3 month Euribor interest rate is the interest rate at which banks lend money to one another with 3 month to pay it off. Lower interest rates means that the term deposits should have low interest rate, so less people would like to take them. Also observe that this plot is a huge mess, because this is daily indicator.
run3(data$nr.employed, data$y,data, "nr.employed")
This variable stays for the number of employees in the bank. I think thats it’s very weird that we can see such a strong increasing relation, because higher number of employees should be able to process the data better. BUT there may be one reason and it’s again related to the 3 years of gathering data: during the first two years the camping was on a small scale so it was way easier to preprocess the data and the outcome was (on average) much better, in the last year the bank employed much more workers for the camping but also made way more calls, so the data of the clients was not as well analyzed as before.
run3(data$age, data$y,data, "age")
This variable is simply the age of the client. As we should have expected there are some intervals: for age <42 peoples tend to have mortgage, have to feed etc. their kids so the expenses are getting higher and higher (older kid means more cash to spend). Around 42 the kids should be close to live on their own, so they require less and less cash and also the mortgage is almost paid off. Age >60 is the retirement age, when people have already accumulated a lot of cash and may put it on the bank account. As we can see, for age smaller than around 64 we got a parabola, while for age above 63 the function looks somewhat linear. We can also spot that for the age above around 70 (the boxplot also stays that we can remove them without a doubt) there are plenty of the outliners, so we can see what happens if we remove them:
table(data$y)
##
## 0 1
## 31444 4191
nrow(data)
## [1] 35635
table(data[data$age <70,]$y)
##
## 0 1
## 31226 4010
nrow(data)-nrow(data[data$age <70,])
## [1] 399
table(data[data$age <70,]$job)
##
## admin. blue-collar entrepreneur housemaid management
## 9135 7961 1278 905 2569
## retired self-employed services student technician
## 1145 1277 3474 661 5915
## unemployed
## 916
table(data$job)
##
## admin. blue-collar entrepreneur housemaid management
## 9143 7964 1278 928 2572
## retired self-employed services student technician
## 1506 1278 3474 661 5915
## unemployed
## 916
Clearly we will lose about 5% of the positive outcomes and about 20% of the pensioners, but I think we should do this. We can also make a varieble of transformed age, where we merge all the ages above 65 and transorm it to get nicely linear graph:
data <- data[data$age <70,]
data$age_transformed <- ifelse(data$age>65, 66, data$age)
data$age_transformed <- (data$age_transformed-42)^2
run3(data$age_transformed, data$y,data, "age_transformed")
I think we are done with the “features explanation” part, just to end the part 2) of the project (I think we also did some of the 3) part):
df <- data
At the end we have to one hot encode caterogical variables, standardization will be required for the PCA, calculating correlations, logistic regression etc but those algorithms are doing it anyway, so there is no point in standardizing our data (I will not do this for the clarity).
#before one hot encoding we need to get the character variables as factor
data <- df
data[sapply(data, is.character)] <- lapply(data[sapply(data, is.character)], as.factor)
data <- one_hot(as.data.table(data))
#After all those merges we can have more duplicates, so we just remove them (and look how much changes after this)
table(data$y)
##
## 0 1
## 31226 4010
table(unique( data )$y)
##
## 0 1
## 31216 4004
data <- unique(data)
#I will standardize the data just to show that I know how to do it
data_standardized<- as.data.frame(scale(data))
data_standardized$y <- data$y
summary(data_standardized)
## age job_admin. job_blue-collar job_entrepreneur
## Min. :-2.3457 Min. :-0.5915 Min. :-0.5403 Min. :-0.194
## 1st Qu.:-0.7841 1st Qu.:-0.5915 1st Qu.:-0.5403 1st Qu.:-0.194
## Median :-0.1594 Median :-0.5915 Median :-0.5403 Median :-0.194
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000
## 3rd Qu.: 0.7776 3rd Qu.: 1.6905 3rd Qu.:-0.5403 3rd Qu.:-0.194
## Max. : 3.0679 Max. : 1.6905 Max. : 1.8508 Max. : 5.153
## job_housemaid job_management job_retired job_self-employed
## Min. :-0.1624 Min. :-0.2804 Min. :-0.1831 Min. :-0.194
## 1st Qu.:-0.1624 1st Qu.:-0.2804 1st Qu.:-0.1831 1st Qu.:-0.194
## Median :-0.1624 Median :-0.2804 Median :-0.1831 Median :-0.194
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000
## 3rd Qu.:-0.1624 3rd Qu.:-0.2804 3rd Qu.:-0.1831 3rd Qu.:-0.194
## Max. : 6.1576 Max. : 3.5665 Max. : 5.4601 Max. : 5.156
## job_services job_student job_technician job_unemployed
## Min. :-0.3307 Min. :-0.1383 Min. :-0.4492 Min. :-0.1634
## 1st Qu.:-0.3307 1st Qu.:-0.1383 1st Qu.:-0.4492 1st Qu.:-0.1634
## Median :-0.3307 Median :-0.1383 Median :-0.4492 Median :-0.1634
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.3307 3rd Qu.:-0.1383 3rd Qu.:-0.4492 3rd Qu.:-0.1634
## Max. : 3.0234 Max. : 7.2306 Max. : 2.2263 Max. : 6.1195
## marital_divorced marital_married marital_single education_basic.4y
## Min. :-0.3558 Min. :-1.2385 Min. :-0.6272 Min. :-0.3319
## 1st Qu.:-0.3558 1st Qu.:-1.2385 1st Qu.:-0.6272 1st Qu.:-0.3319
## Median :-0.3558 Median : 0.8074 Median :-0.6272 Median :-0.3319
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.3558 3rd Qu.: 0.8074 3rd Qu.: 1.5944 3rd Qu.:-0.3319
## Max. : 2.8108 Max. : 0.8074 Max. : 1.5944 Max. : 3.0128
## education_basic.6y education_basic.9y education_high.school
## Min. :-0.2505 Min. :-0.4277 Min. :-0.5682
## 1st Qu.:-0.2505 1st Qu.:-0.4277 1st Qu.:-0.5682
## Median :-0.2505 Median :-0.4277 Median :-0.5682
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2505 3rd Qu.:-0.4277 3rd Qu.:-0.5682
## Max. : 3.9915 Max. : 2.3379 Max. : 1.7600
## education_professional.course education_university.degree housing
## Min. :-0.3933 Min. :-0.6687 Min. :-1.0767
## 1st Qu.:-0.3933 1st Qu.:-0.6687 1st Qu.:-1.0767
## Median :-0.3933 Median :-0.6687 Median : 0.9288
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.3933 3rd Qu.: 1.4953 3rd Qu.: 0.9288
## Max. : 2.5423 Max. : 1.4953 Max. : 0.9288
## loan month_3_7_9_12 month_apr month_aug
## Min. :-0.4399 Min. :-0.2191 Min. :-0.2623 Min. :-0.4137
## 1st Qu.:-0.4399 1st Qu.:-0.2191 1st Qu.:-0.2623 1st Qu.:-0.4137
## Median :-0.4399 Median :-0.2191 Median :-0.2623 Median :-0.4137
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.4399 3rd Qu.:-0.2191 3rd Qu.:-0.2623 3rd Qu.:-0.4137
## Max. : 2.2733 Max. : 4.5630 Max. : 3.8117 Max. : 2.4169
## month_jul month_jun month_may month_nov
## Min. :-0.4469 Min. :-0.3847 Min. :-0.7241 Min. :-0.3412
## 1st Qu.:-0.4469 1st Qu.:-0.3847 1st Qu.:-0.7241 1st Qu.:-0.3412
## Median :-0.4469 Median :-0.3847 Median :-0.7241 Median :-0.3412
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.4469 3rd Qu.:-0.3847 3rd Qu.: 1.3811 3rd Qu.:-0.3412
## Max. : 2.2374 Max. : 2.5995 Max. : 1.3811 Max. : 2.9311
## day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue_wed
## Min. :-0.4848 Min. :-0.5152 Min. :-0.5108 Min. :-0.8046
## 1st Qu.:-0.4848 1st Qu.:-0.5152 1st Qu.:-0.5108 1st Qu.:-0.8046
## Median :-0.4848 Median :-0.5152 Median :-0.5108 Median :-0.8046
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.4848 3rd Qu.:-0.5152 3rd Qu.:-0.5108 3rd Qu.: 1.2428
## Max. : 2.0626 Max. : 1.9409 Max. : 1.9577 Max. : 1.2428
## campaign emp.var.rate cons.price.idx cons.conf.idx
## Min. :-0.7562 Min. :-2.2165 Min. :-2.3668 Min. :-2.2147
## 1st Qu.:-0.7562 1st Qu.:-1.1931 1st Qu.:-0.8539 1st Qu.:-0.4572
## Median :-0.1853 Median : 0.6619 Median :-0.2151 Median :-0.2619
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3856 3rd Qu.: 0.8538 3rd Qu.: 0.7369 3rd Qu.: 0.9098
## Max. : 3.8112 Max. : 0.8538 Max. : 2.0750 Max. : 2.9710
## euribor3m nr.employed y default_unknown
## Min. :-1.7162 Min. :-2.8450 Min. :0.0000 Min. :-0.5061
## 1st Qu.:-1.3062 1st Qu.:-0.9476 1st Qu.:0.0000 1st Qu.:-0.5061
## Median : 0.7226 Median : 0.3393 Median :0.0000 Median :-0.5061
## Mean : 0.0000 Mean : 0.0000 Mean :0.1137 Mean : 0.0000
## 3rd Qu.: 0.7827 3rd Qu.: 0.8588 3rd Qu.:0.0000 3rd Qu.:-0.5061
## Max. : 0.8312 Max. : 0.8588 Max. :1.0000 Max. : 1.9758
## contact_by_tel last_campaign_contacted exp_previous prev_success
## Min. :-0.7635 Min. :-0.1933 Min. :-3.6003 Min. :-0.184
## 1st Qu.:-0.7635 1st Qu.:-0.1933 1st Qu.: 0.3977 1st Qu.:-0.184
## Median :-0.7635 Median :-0.1933 Median : 0.3977 Median :-0.184
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000
## 3rd Qu.: 1.3097 3rd Qu.:-0.1933 3rd Qu.: 0.3977 3rd Qu.:-0.184
## Max. : 1.3097 Max. : 5.1724 Max. : 0.3977 Max. : 5.436
## age_transformed
## Min. :-0.9804
## 1st Qu.:-0.8205
## Median :-0.3409
## Mean : 0.0000
## 3rd Qu.: 0.4585
## Max. : 5.2648
#this functions plots the graph for choice_one, choice_two principal components, it shows only the arrows for rows from first_row to last_row
plot_pca = function (data_pca, first_row, last_row, choice_one, choice_two){
data_pca$y <-NULL
data_pca <- prcomp(data_pca, center = TRUE,scale. = TRUE)
g <- ggbiplot(data_pca, ellipse=TRUE, groups=data$y, obs.scale = 1, var.scale = 1, alpha=0, choices = choice_one:choice_two)
g <- ggplot_build(g)
g$data[[1]] <- g$data[[1]][(first_row:last_row), ]
g$data[[4]] <- g$data[[4]][(first_row:last_row), ]
plot(ggplot_gtable(g))
}
#this one makes a ordered table of highest correlations between variables above some threshold
ordered_high_cor = function(data, threshold ) {
tab<-cor(data)
tab <- as.data.frame(as.table(tab))
tab <- subset(tab, abs(Freq) >threshold & ! tab$Var1==tab$Var2)
tab <- tab[order(abs(tab$Freq), decreasing = TRUE),]
tab[seq(1, nrow(tab), 2),]
}
#this one lists the lowest correlations with our target variable
ordered_low_y_cor= function(data, threshold ) {
tab<- cor(data, data$y)
tab <- as.data.frame(as.table(tab))
tab <- subset(tab, abs(Freq) < threshold)
tab <- tab[order(abs(tab$Freq), decreasing = FALSE),]
tab$Var2 <- NULL
tab
}
Looking at the job variables first:
#here only job_admin. job_blue.collar job_entrepreneur job_menagement and job_housemaid are considered
plot_pca (data, 2, 6, 1,2)
plot_pca (data, 2, 6, 2,3)
plot_pca (data, 2, 6, 3,4)
#here is the rest of the jobs
plot_pca (data, 7, 12, 1,2)
plot_pca (data, 7, 12, 2,3)
plot_pca (data, 7, 12, 3,4)
ordered_low_y_cor(data, 0.1)
First lets observe that in in our case PC1 is extremely important, the other ones explain around 2 or less percent of the variance in y. As we can see the job_menagement, job_self.employed and job_housemaid does not contribute to any of the PC from 1 to 4 and it has the lowest correlation, so we will remove them. Also as we can see the job_technician has very low correlation, but it is rather important for the first principal component so we will not remove it:
data$job_management <- NULL
data[,7] <- NULL
data$job_housemaid <- NULL
Now lets consider how the job and education variables correlates on the PCA graphs:
plot_pca (data, 2, 18, 1,2)
#now running those just to see where the names are
plot_pca (data, 2, 11, 1,2)
plot_pca (data, 12, 18, 1,2)
#some other, cleaner pca graphs
plot_pca (data, 2, 18, 2,3)
plot_pca (data, 12, 18, 3,4)
ordered_low_y_cor(data, 0.1)
As we can see the PC1 is getting more and more important… We can also now merge the university education with the admin profession and all of the basic educations with the blue collar job. I will also remove education_professional.course because it has very low correlation with y and contributes to the PC from 1 to 3 only a little bit:
data$education_professional.course <- NULL
data$univ_or_admin <- ifelse(data$job_admin.==1 | data$education_university.degree ==1, 1,0)
data$job_admin. <- NULL
data$education_university.degree <- NULL
data$basic_or_bluecoll <- ifelse(data$education_basic.4y ==1 | data$education_basic.6y ==1 |data$education_basic.9y ==1 | data[,2]==1, 1,0)
data$education_basic.4y <- NULL
data$education_basic.6y <- NULL
data$education_basic.9y <- NULL
data[,2] <- NULL
Going for the day and month variables:
#splitted into two parts so its cleaner:
plot_pca (data, 14, 19, 1,2)
plot_pca (data, 14, 19, 2,3)
plot_pca (data, 14, 19, 3,4)
plot_pca (data, 20, 24, 1,2)
plot_pca (data, 20, 24, 2,3)
plot_pca (data, 20, 24, 3,4)
plot_pca (data, 20, 24, 4,5)
ordered_low_y_cor(data, 0.1)
I think we should leave all of the month variables and remove the day_of_week_fri and day_of_week_tue_wed variables (both have very low correlations and are not too important in terms of PCA analysis).
data$day_of_week_tue_wed <- NULL
data$day_of_week_fri <- NULL
Going for the housing and loan variables:
plot_pca (data, 12, 13, 1,2)
plot_pca (data, 12, 13, 2,3)
plot_pca (data, 12, 13, 3,4)
plot_pca (data, 12, 13, 4,5)
ordered_low_y_cor(data,0.1)
We can surely remove them:
data$housing <- NULL
data$loan <- NULL
We can look at the rest of the dummy variables:
plot_pca (data, 27, 31, 1,2)
plot_pca (data, 27, 31, 2,3)
plot_pca (data, 27, 31, 3,4)
plot_pca (data, 27, 31, 4,5)
ordered_high_cor(data,0.1)
cor(data$y,data$last_campaign_contacted)
## [1] 0.3123198
cor(data$y,data$prev_success)
## [1] 0.3040679
prev_success and last_campaign_contacted are extreamly correlated and look the same on the pca, so we can remove the variable with lower correlation with our target variable:
data$prev_success <- NULL
We have already deleted 13 variables (almost 30% of all), so we can now try some more computing demanding techniques. We will now use k-means clustering, but sadly we can do so only for the continuous variables (we can also use Gower’s metric, but its not a build in option).
#first we must get only the continuous variables
data_clustering <- data
data_clustering <- data_clustering[,-c(2:20)]
data_clustering <- data_clustering[,-c(8:11)]
data_clustering <- data_clustering[,-c(10:11)]
data_clustering <- unique(data_clustering)
data_clustering<- as.data.frame(scale(data_clustering))
#now in order to cluster variables we need to transpose the dataset
data_clustering_t <- t(data_clustering)
#We gonna restart the algorithm 25 times (so the outcome will be better) and make 4 clusters, I won't optimize the number of clusters right here
set.seed(100)
km <- kmeans(data_clustering_t, centers=4, nstart = 25)
fviz_cluster(km, data_clustering_t)
The above stays that nr.employed, emp.var.rate and eurobor3m have a very similar behavior, we can also see it on heatmap of correlations:
corrplot(cor(data_clustering), method = "pie", tl.cex=0.5)
Clearly we should remove two of those 3, I think we should remove the euribo3m and emp.var.rate because they are not linear, and also euribo3m contains a lot of nois. As we can see by doing so we will not lose too much rows after choosing the unique ones:
nrow(unique(data))
## [1] 31643
table(unique(data)$y)
##
## 0 1
## 27721 3922
data$euribor3m <- NULL
data$emp.var.rate <- NULL
nrow(unique(data))
## [1] 28097
table(unique(data)$y)
##
## 0 1
## 24225 3872
We can also use hierarchical clustering:
#of course we dont want to have our target variable in the dendrogram
hc <- hclust(dist(t(data[,-25])))
plot(hc)
The above tree stays that all the variables except for the nr.employed may be considered as one huge cluster, so I will not merge anything useing this graph.
corrplot(cor(data), method = "pie", tl.cex=0.5)
ordered_low_y_cor(data,0.1)
ordered_high_cor(data,0.1)
Looking at the heatmap once again I think that we should drop both marital_divorced and cons.price.idx, the latter is one of the most correlated ones. For the marital_divorced we will not lose any information, because we still got the marital_married and marital_single variables.
data$marital_divorced <- NULL
data$cons.price.idx <- NULL
We should also observe that we have to remove at least one of the months variable of otherwise we will get into something called the “dummy trap”, because if for example all of the month variables but the month_jul are 0 we know that the month has to be july, so there is a perfect linear correlation between those variables. As we had seen before, non of the merges seemed reasonable so we will just remove the one with lowest correlation with our target variable:
ordered_low_y_cor(data,0.1)
data$month_jun <- NULL
table(data$y)
##
## 0 1
## 31216 4004
data <- unique(data)
table(data$y)
##
## 0 1
## 24225 3872
data
I think that at this point our data may be considered a lot cleaner. We had reduced the number of variables from 46 to 28 (I include all of the one hot encoded) and menage to balance the data by doing so (reducing the repeating rows). I know that by doing so we probably will got a worse score on correlation robust algorithms such as the random forest, but by doing so we have shortened the computation time and we can use some better algorithms optimization techniques. Also note that we had menage to balance our dataset a little bit.
#I know that there are some fancy ways of the oversampling, but unfortunately the SMOTE function seems to be bugged (or I dont know how to use it)
oversample = function (data, seed ){
if(missing(seed)) {
seed=100
}
set.seed(seed)
rows_sampled = sample (1:nrow(data[data$y==1,]), nrow(data[data$y==0,]), replace = TRUE)
data_1 <- data[data$y==1,]
data_1 <- data_1[rows_sampled,]
return(merge(data[data$y==0,], data_1, all=TRUE))
}
#splitting the dataset for training test and validation
data_split = function(data_to_split, ratio_train, ratio_test, ratio_validate, seed){
if(missing(seed)) {
seed=100
}
spec = c(train = ratio_train, test = ratio_test, validate = ratio_validate)
set.seed(seed)
g = sample(cut(seq(nrow(data_to_split)), nrow(data_to_split)*cumsum(c(0,spec)),labels = names(spec)))
data_g = split(data_to_split, g)
set.seed(seed)
#we need to balance the training set, balancing test or validation will result in meaningless results (because thenwe would model the oversampled
#datatset)
data_g[[1]] <- oversample(data_g[[1]], seed)
return(data_g)
}
For the measurement metric we will use the F-score. The reason is that our data is unbalanced, so measure like the accuracy is pointless, so we need some measure to check how well the positives were predicted and at the same time be careful not to call the ones that will not subscribe for the term deposit. We can also use the AUC score, but computing it is way more time demanding and will not be too hopeful if we want to for example optimize the logistic regression cut of threshold. For the first model we will simply use logistic regression because its very quick, we got way more data than variables so it should not overfit and its not a blackbox, so we can easily check the outcome if we want to. I will not use the step function (variables reduction technique) because it would lower our F score. The bad thing about the linear regression is that there are no hyperparameters to tune…
model_regression = function (data_splitted, seed) {
if(missing(seed)) {
seed=100
}
training <- data_splitted[[1]]
test <- data_splitted[[2]]
validation <- data_splitted[[3]]
set.seed(seed)
model = glm(y ~ ., data = training, family = binomial)
#for getting the optimal cut of threshold we will be looking at the highest F score on the test set
CUT_OFFS = seq(0, 1, by = 0.01)
Fscore = function(x, outcome, test){
outcome_y <- ifelse(outcome > x, 1, 0)
tab <- table(outcome_y, test$y)
#Im using the ifelse because if the level of cut is too big/too low all the outcome will be classified as 1 or 0 (and it means that the cut is terrible)
ifelse (nrow(tab)==2, tab[2,2]/(tab[2,2]+(tab[1,2]+tab[2,1])/2),0 )
}
set.seed(seed)
outcome <- predict(model, newdata = test, type = "response")
Fscores <- sapply(CUT_OFFS, function(x) Fscore (x, outcome, test))
optimal_cut <- (which.max(Fscores)[1]-1)*0.01
outcome_test <- predict(model, newdata=test, type="response")
outcome_validation <- predict(model, newdata = validation, type = "response")
outcome_y_validation <- ifelse(outcome_validation > optimal_cut, 1, 0)
outcome_y_test <- ifelse(outcome_test > optimal_cut, 1, 0)
conf_matrix <- table(outcome_y_validation, validation$y)
Fscore <- function(x){x[2,2]/(x[2,2]+(x[1,2]+x[2,1])/2)}
print(conf_matrix)
print(Fscore(conf_matrix))
#now using the code from the labs we produce some curves
score.or.class = gain = lift = roc = auc = prediction.object = list()
score.or.class[[1]] = list(validation$y, validation$y)
score.or.class[[2]] = list(outcome_test,test$y)
score.or.class[[3]] = list(outcome_validation, validation$y)
class.average = mean(validation$y)
random.class = 1
for (i in 1:(nrow(validation) - 1)) {
random.class = c(random.class, mean(random.class) < class.average)
}
score.or.class[[4]] = list(seq(0, 1, len = nrow(validation)), random.class)
for (i in 1:length(score.or.class)) {
prediction.object[[i]] = prediction(score.or.class[[i]][[1]],
score.or.class[[i]][[2]])
gain[[i]] = performance(prediction.object[[i]], "tpr", "rpp")
lift[[i]] = performance(prediction.object[[i]], "lift", "rpp")
roc[[i]] = performance(prediction.object[[i]], "tpr", "fpr")
auc[[i]] = performance(prediction.object[[i]], "auc")
}
LEGEND_LABELS = c("wizard", "test", "validation", "random")
#Creating plots
ShowCurve = function(list, name, AUC = FALSE, legend.position = "right") {
for (i in 1:length(list)) {
plot(list[[i]], main = paste(name, " curve"),
col = gray((i - 1) / 4), lwd = 2, add = (i != 1), xlim = c(0, 1))
if (AUC) {
text(.2, 0.9 - i * 0.1, pos = 4, col = gray((i - 1) / 4), cex = .9,
paste("AUC =", round(auc[[i]]@y.values[[1]], digit = 2)))
}
}
legend(legend.position, lty = 2, lwd = 2, col = gray(0:3 / 4),
y.intersp = .6, legend = LEGEND_LABELS, seg.len = 0.6, bty = "n")
}
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
ShowCurve(gain, "Gain")
ShowCurve(lift, "Lift", legend.position = "topright")
ShowCurve(roc, "ROC", AUC = TRUE)
return(model)
}
#Im using the simple 60:20:20 splits
start.time <- Sys.time()
model_regression_outcome <- model_regression (data_split(data,0.6,0.2,0.2))
##
## outcome_y_validation 0 1
## 0 4308 327
## 1 561 424
## [1] 0.4884793
end.time <- Sys.time()
end.time - start.time
## Time difference of 2.444954 secs
We can see a few things: looking at the curves we may observe that there seems to be no be almost no problem with the overfitting (so using lasso or elastic net is pointless), the model runs extremely fast and that most of the positive outcomes were (luckily) predicted as positive.
summary(model_regression_outcome)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7890 -0.9151 -0.1085 0.9380 2.1150
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 39.9283593 1.2894413 30.966 < 2e-16 ***
## age -0.0031225 0.0016335 -1.912 0.055938 .
## job_entrepreneur -0.1258485 0.0717517 -1.754 0.079440 .
## job_retired 0.0951857 0.0818564 1.163 0.244896
## job_services -0.1406757 0.0553642 -2.541 0.011056 *
## job_student 0.3012275 0.0910389 3.309 0.000937 ***
## job_technician -0.0757684 0.0444169 -1.706 0.088037 .
## job_unemployed -0.2839895 0.0810680 -3.503 0.000460 ***
## marital_married 0.1887688 0.0437097 4.319 1.57e-05 ***
## marital_single 0.1055187 0.0500578 2.108 0.035036 *
## education_high.school -0.1978347 0.0366373 -5.400 6.67e-08 ***
## month_3_7_9_12 0.2484240 0.0713184 3.483 0.000495 ***
## month_apr -0.2440274 0.0634500 -3.846 0.000120 ***
## month_aug -0.1938308 0.0681708 -2.843 0.004465 **
## month_jul -0.1378374 0.0575602 -2.395 0.016635 *
## month_may -0.7134274 0.0477377 -14.945 < 2e-16 ***
## month_nov -0.7008220 0.0626567 -11.185 < 2e-16 ***
## day_of_week_mon -0.2957453 0.0342703 -8.630 < 2e-16 ***
## day_of_week_thu -0.0682798 0.0337407 -2.024 0.043005 *
## campaign -0.0908654 0.0080732 -11.255 < 2e-16 ***
## cons.conf.idx 0.0137456 0.0037320 3.683 0.000230 ***
## nr.employed -0.0076667 0.0002572 -29.813 < 2e-16 ***
## default_unknown -0.3383842 0.0391247 -8.649 < 2e-16 ***
## contact_by_tel -0.5501668 0.0422701 -13.016 < 2e-16 ***
## last_campaign_contacted 1.7542983 0.0770649 22.764 < 2e-16 ***
## exp_previous 0.7665682 0.0655320 11.698 < 2e-16 ***
## age_transformed 0.0009528 0.0001426 6.683 2.35e-11 ***
## univ_or_admin 0.0743733 0.0419378 1.773 0.076159 .
## basic_or_bluecoll -0.1076141 0.0452887 -2.376 0.017493 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40227 on 29017 degrees of freedom
## Residual deviance: 32559 on 28989 degrees of freedom
## AIC: 32617
##
## Number of Fisher Scoring iterations: 5
By the above it turns out that almost all of variables used are statistically significant, the model runs very fast so we can try expanding the variables used by the interactions with the rest of the variables ( excluding interactions between the campaign, cons.conf.idx and nr.employed):
model_regression_expanded = function (data_splitted, seed) {
if(missing(seed)) {
seed=100
}
training <- data_splitted[[1]]
test <- data_splitted[[2]]
validation <- data_splitted[[3]]
set.seed(seed)
model = glm(y ~ + (.)*(+campaign+cons.conf.idx+nr.employed) , data = training, family = binomial)
#for getting the optimal cut of threshold we will be looking at the highest F score on the test set
CUT_OFFS = seq(0, 1, by = 0.01)
Fscore = function(x, outcome, test){
outcome_y <- ifelse(outcome > x, 1, 0)
tab <- table(outcome_y,test$y)
#Im using the ifelse because if the level of cut is too big/too low all the outcome will be classified as 1 or 0 (and it means that the cut is terrible)
ifelse (nrow(tab)==2, tab[2,2]/(tab[2,2]+(tab[1,2]+tab[2,1])/2),0 )
}
set.seed(seed)
outcome <- predict(model, newdata = test, type = "response")
Fscores <- sapply(CUT_OFFS, function(x) Fscore (x, outcome, test))
optimal_cut <- (which.max(Fscores)[1]-1)*0.01
outcome_test <- predict(model, newdata=test, type="response")
outcome_validation <- predict(model, newdata = validation, type = "response")
outcome_y_validation <- ifelse(outcome_validation > optimal_cut, 1, 0)
outcome_y_test <- ifelse(outcome_test > optimal_cut, 1, 0)
conf_matrix <- table(outcome_y_validation, validation$y )
Fscore <- function(x){x[2,2]/(x[2,2]+(x[1,2]+x[2,1])/2)}
print(conf_matrix)
print(Fscore(conf_matrix))
#now using the code from the labs we produce some curves
score.or.class = gain = lift = roc = auc = prediction.object = list()
score.or.class[[1]] = list(validation$y, validation$y)
score.or.class[[2]] = list(outcome_test,test$y)
score.or.class[[3]] = list(outcome_validation, validation$y)
class.average = mean(validation$y)
random.class = 1
for (i in 1:(nrow(validation) - 1)) {
random.class = c(random.class, mean(random.class) < class.average)
}
score.or.class[[4]] = list(seq(0, 1, len = nrow(validation)), random.class)
for (i in 1:length(score.or.class)) {
prediction.object[[i]] = prediction(score.or.class[[i]][[1]],
score.or.class[[i]][[2]])
gain[[i]] = performance(prediction.object[[i]], "tpr", "rpp")
lift[[i]] = performance(prediction.object[[i]], "lift", "rpp")
roc[[i]] = performance(prediction.object[[i]], "tpr", "fpr")
auc[[i]] = performance(prediction.object[[i]], "auc")
}
LEGEND_LABELS = c("wizard", "test", "validation", "random")
#Creating plots
ShowCurve = function(list, name, AUC = FALSE, legend.position = "right") {
for (i in 1:length(list)) {
plot(list[[i]], main = paste(name, " curve"),
col = gray((i - 1) / 4), lwd = 2, add = (i != 1), xlim = c(0, 1))
if (AUC) {
text(.2, 0.9 - i * 0.1, pos = 4, col = gray((i - 1) / 4), cex = .9,
paste("AUC =", round(auc[[i]]@y.values[[1]], digit = 2)))
}
}
legend(legend.position, lty = 2, lwd = 2, col = gray(0:3 / 4),
y.intersp = .6, legend = LEGEND_LABELS, seg.len = 0.6, bty = "n")
}
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
ShowCurve(gain, "Gain")
ShowCurve(lift, "Lift", legend.position = "topright")
ShowCurve(roc, "ROC", AUC = TRUE)
print(optimal_cut)
return(model)
}
#Im using the simple 60:20:20 splits
start.time <- Sys.time()
model_regression_expanded_outcome <- model_regression_expanded (data_split(data,0.6,0.2,0.2))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
##
## outcome_y_validation 0 1
## 0 4419 346
## 1 450 405
## [1] 0.5043587
## [1] 0.65
end.time <- Sys.time()
end.time - start.time
## Time difference of 4.576219 secs
The warning “prediction from a rank-deficient fit may be misleading” is because of the dummy trap I had discussed before. As we can see the ROC curves and the F-score are a bit better (even thought there is a dummy trap), if we add more we will surely got the multicolinearity problem so we will stop here.
For the next model we will use K nearest neighbors. This model is very good if there is not too much noise in the dataset and it can capture non-linear relationship with the target variable, as far as I know its also widely used for recommendations (e.g. on Netflix). The bad thing is that this model runs very slowly and is very prone to the noise.
model_KNN = function (data_splitted, K_lowest, K_highest, seed) {
if(missing(seed)) {
seed=100
}
training <- data_splitted[[1]]
test <- data_splitted[[2]]
validation <- data_splitted[[3]]
#for knn we need to split our datasets in 2 parts
target_training <- training$y
test_y <- test$y
validation_y <- validation$y
training$y <- NULL
test$y <- NULL
validation$y <- NULL
K_table = c(K_lowest:K_highest)
Fscores=c()
for (i in K_table){
outcome_y <- knn(training ,test,cl=target_training,k=i)
tab <- table(outcome_y ,test_y)
Fscores[i] <- ifelse (nrow(tab)==2, tab[2,2]/(tab[2,2]+(tab[1,2]+tab[2,1])/2),0 )
}
optimal_K <- which.max(Fscores)[1]
outcome_y <- knn(training ,validation, cl=target_training, k=optimal_K)
conf_matrix <- table(outcome_y ,validation_y)
Fscore <- function(x){ ifelse (ncol(tab)==2, tab[2,2]/(tab[2,2]+(tab[1,2]+tab[2,1])/2),0 )}
print(optimal_K)
#in this case the relation between the K and F score is very interesting so we will plot it
Fscores <- Fscores[!is.na(Fscores)]
plot(K_table,Fscores)
print(Fscore(conf_matrix))
print(conf_matrix)
outcome_test <- knn(training, test, cl=target_training, k=optimal_K, prob=TRUE)
outcome_validation <- knn(training, validation, cl=target_training, k=optimal_K, prob=TRUE)
#for the curves we need to have the probability outcomes, otherwise the graphs would be just one point
outcome_test <- attr(outcome_test, "prob")
outcome_validation <- attr(outcome_validation, "prob")
score.or.class = gain = lift = roc = auc = prediction.object = list()
score.or.class[[1]] = list(validation_y, validation_y)
score.or.class[[2]] = list(outcome_test,test_y)
score.or.class[[3]] = list(outcome_validation, validation_y)
class.average = mean(validation_y)
random.class = 1
for (i in 1:(nrow(validation) - 1)) {
random.class = c(random.class, mean(random.class) < class.average)
}
score.or.class[[4]] = list(seq(0, 1, len = nrow(validation)), random.class)
for (i in 1:length(score.or.class)) {
prediction.object[[i]] = prediction(score.or.class[[i]][[1]],
score.or.class[[i]][[2]])
gain[[i]] = performance(prediction.object[[i]], "tpr", "rpp")
lift[[i]] = performance(prediction.object[[i]], "lift", "rpp")
roc[[i]] = performance(prediction.object[[i]], "tpr", "fpr")
auc[[i]] = performance(prediction.object[[i]], "auc")
}
LEGEND_LABELS = c("wizard", "test", "validation", "random")
#Creating plots
ShowCurve = function(list, name, AUC = FALSE, legend.position = "right") {
for (i in 1:length(list)) {
plot(list[[i]], main = paste(name, " curve"),
col = gray((i - 1) / 4), lwd = 2, add = (i != 1), xlim = c(0, 1))
if (AUC) {
text(.2, 0.9 - i * 0.1, pos = 4, col = gray((i - 1) / 4), cex = .9,
paste("AUC =", round(auc[[i]]@y.values[[1]], digit = 2)))
}
}
legend(legend.position, lty = 2, lwd = 2, col = gray(0:3 / 4),
y.intersp = .6, legend = LEGEND_LABELS, seg.len = 0.6, bty = "n")
}
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
ShowCurve(gain, "Gain")
ShowCurve(lift, "Lift", legend.position = "topright")
ShowCurve(roc, "ROC", AUC = TRUE)
#return(conf_matrix)
}
#we need to standardize the data first, because as far as I know the build in algorythm wont do this
data_standardized<- as.data.frame(scale(data))
data_standardized$y <- data$y
#we will begin with running it for K from 1 to 100 but be careful, it DOES take a while
start.time <- Sys.time()
model_KNN(data_split(data_standardized,0.6,0.2,0.2),1,100)
## [1] 98
## [1] 0.4057613
## validation_y
## outcome_y 0 1
## 0 3715 260
## 1 1154 491
end.time <- Sys.time()
end.time - start.time
## Time difference of 12.31584 mins
I dont want to show all of the K checked, because it would take more than 3 hours. What we can observe is that F score consistently rises with K, so we can try running it to the biggest K possible:
start.time <- Sys.time()
model_KNN(data_split(data_standardized,0.6,0.2,0.2),450,499)
## [1] 481
## [1] 0.4401325
## validation_y
## outcome_y 0 1
## 0 4030 294
## 1 839 457
end.time <- Sys.time()
end.time - start.time
## Time difference of 18.6216 mins
As we can observe the F score is still kind of rising with K (the relation is like x+sin(x)), but at this range of K it is still worse than the regression: the F score is is still way lower than before, we can also observe that the ROC curves seems to always be lower than the curve from logistic regression model (this is a clear sign that this model is worse). We should note that the biggest advantage of this model is that the misspredictions of class 1 as 0 are much better than the linear model.
Such a low AUC score with pretty good F score may imply that most of the the votes for each point were almost ties, so we surely have some huge noise issues. Same goes for such a big K, because we need to consider a huge number of neighbors to overcome the noise. Another thing to consider is the fact that the with lower K there is lower number of misspredicted 1 as the 0, so it can imply that the ones are somewhat close to each other, but the noise makes the prediction difficult.
The another drawback of this model is that there are not too much hyperparameters to tune: we can surely tune the number of neighbors considered (K), but it is very slow and as far as I know there are no more parameters to consider, one may be the weightened KNN but it would take way too long.
For the next model we will use an ensemble method, mainly the random forest. This model have plenty of advanteges: it is robust to outliners (I think not to useful in our case), it can capture non-linear relationships (meanwhile the logistic regression cannot, and as we have seen there are propably plenty of them) and it tends not to overfit (it would be surely an issue for the standard decision tree). The drawbacks is that the algorithm trains very slowly, needs a lot of data (not sure if our dataset is big enough) and is not suitable for linear models with a lot of sparse features (it may be our case because we have plenty of dummy variables for month, job and education).
model_random_forest = function (data_splitted, n_test, n_model, seed) {
if(missing(seed)) {
seed=100
}
training <- data_splitted[[1]]
test <- data_splitted[[2]]
validation <- data_splitted[[3]]
#the optimal mtry will be 2 so you can change this line if want to run it faster
mtry = c(1:(ncol(training)-1))
Fscores=c()
for (i in mtry){
set.seed(seed)
model_forest <- randomForest(as.factor(y)~.,data=training, ntree=n_test, mtry=mtry[i] )
set.seed(seed)
outcome_y <- predict(model_forest, newdata=test, type="response")
tab <- table(outcome_y ,test$y)
Fscores[i] <- ifelse (nrow(tab)==2, tab[2,2]/(tab[2,2]+(tab[1,2]+tab[2,1])/2),0 )
}
optimal_mtry <- which.max(Fscores)[1]
model_forest <- randomForest(as.factor(y)~.,data=training, ntree=n_model, mtry=optimal_mtry )
outcome_y <- predict(model_forest, newdata=validation, type="response")
conf_matrix <- table(outcome_y ,validation$y)
Fscore <- function(x){ ifelse (nrow(tab)==2, tab[2,2]/(tab[2,2]+(tab[1,2]+tab[2,1])/2),0 )}
print(optimal_mtry)
#we will also plot the relation between mtry hyperparameter and the Fscores
Fscores <- Fscores[!is.na(Fscores)]
plot(mtry,Fscores)
print(Fscore(conf_matrix))
print(conf_matrix)
outcome_test = predict(model_forest, newdata=test, type="prob")[,2]
outcome_validation = predict(model_forest, newdata=validation, type="prob")[,2]
score.or.class = gain = lift = roc = auc = prediction.object = list()
score.or.class[[1]] = list(validation$y, validation$y)
score.or.class[[2]] = list(outcome_test,test$y)
score.or.class[[3]] = list(outcome_validation, validation$y)
class.average = mean(validation$y)
random.class = 1
for (i in 1:(nrow(validation) - 1)) {
random.class = c(random.class, mean(random.class) < class.average)
}
score.or.class[[4]] = list(seq(0, 1, len = nrow(validation)), random.class)
for (i in 1:length(score.or.class)) {
prediction.object[[i]] = prediction(score.or.class[[i]][[1]],
score.or.class[[i]][[2]])
gain[[i]] = performance(prediction.object[[i]], "tpr", "rpp")
lift[[i]] = performance(prediction.object[[i]], "lift", "rpp")
roc[[i]] = performance(prediction.object[[i]], "tpr", "fpr")
auc[[i]] = performance(prediction.object[[i]], "auc")
}
LEGEND_LABELS = c("wizard", "test", "validation", "random")
#Creating plots
ShowCurve = function(list, name, AUC = FALSE, legend.position = "right") {
for (i in 1:length(list)) {
plot(list[[i]], main = paste(name, " curve"),
col = gray((i - 1) / 4), lwd = 2, add = (i != 1), xlim = c(0, 1))
if (AUC) {
text(.2, 0.9 - i * 0.1, pos = 4, col = gray((i - 1) / 4), cex = .9,
paste("AUC =", round(auc[[i]]@y.values[[1]], digit = 2)))
}
}
legend(legend.position, lty = 2, lwd = 2, col = gray(0:3 / 4),
y.intersp = .6, legend = LEGEND_LABELS, seg.len = 0.6, bty = "n")
}
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
ShowCurve(gain, "Gain")
ShowCurve(lift, "Lift", legend.position = "topright")
ShowCurve(roc, "ROC", AUC = TRUE)
return(model_forest)
}
start.time <- Sys.time()
#of course the higher the number of trees the better outcome will be, but we will just use 50 trees for looking for the optimal mtry parameter and 2000 to #build a model with the optimal mtry number
model_random_forest_outcome <- model_random_forest(data_split(data,0.6,0.2,0.2),50,2000)
## [1] 2
## [1] 0.339895
##
## outcome_y 0 1
## 0 4251 307
## 1 618 444
end.time <- Sys.time()
end.time - start.time
## Time difference of 5.933063 mins
As we can see the optimal mtry is 2, which means that we are using trees with high bias (it uses a lot of variables for the first split). We should also note that the F score is very low compare to the previous ones and that the AUC is almost the highest one. We can also look at the mean Decrease in Gini index for the variables:
model_random_forest_outcome$importance
## MeanDecreaseGini
## age 295.75665
## job_entrepreneur 24.00365
## job_retired 26.75153
## job_services 42.25246
## job_student 43.20670
## job_technician 40.14391
## job_unemployed 23.38073
## marital_married 48.80311
## marital_single 47.99782
## education_high.school 56.62647
## month_3_7_9_12 301.41862
## month_apr 63.69615
## month_aug 38.19948
## month_jul 43.07444
## month_may 109.94416
## month_nov 51.19064
## day_of_week_mon 62.25687
## day_of_week_thu 55.07396
## campaign 212.30970
## cons.conf.idx 499.26585
## nr.employed 926.30016
## default_unknown 153.91321
## contact_by_tel 283.06969
## last_campaign_contacted 390.03421
## exp_previous 214.06983
## age_transformed 284.23490
## univ_or_admin 67.86893
## basic_or_bluecoll 61.62568
Fist of all we should note, that those indexes are (almost) always lower for the dummys. The reason is simple: dummy variable can be splitted only one time. As we can see the nr.employed is the most important one (by a lot) so we should have expected that the mtry parameter will be very low. This table also tells us that we could have reduced way more variables without losing too much of the predictive power. The drawback of this model is that there is only one hyperparameter to optimaze (sure we can also reduce the maximal depth of the trees, but in our case overfitting is not a problem).
Now we can run the optimal algorithms on the same dataset, get the outcomes and try to compare them. We will still be using the same dataset, you can change the seeds if you want to:
data_valuation <- data_split(data,0.6,0.2,0.2)
training<- data_valuation[[1]]
test <- data_valuation[[2]]
validation <- data_valuation[[3]]
#going for the logistic
set.seed(100)
model_reg = glm(y ~ + (.)*(+campaign+cons.conf.idx+nr.employed) , data = training, family = binomial)
outcome_reg <- predict(model_reg, newdata = validation, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
#we know the optimal cut
outcome_y_reg <- ifelse(outcome_reg > 0.65, 1, 0)
conf_matrix_reg <- table(outcome_y_reg, validation$y )
#for the knn
set.seed(100)
data_valuation_knn <- data_split(data_standardized,0.6,0.2,0.2)
training_knn<- data_valuation_knn[[1]]
test_knn <- data_valuation_knn[[2]]
validation_knn <- data_valuation_knn[[3]]
target_training <- training_knn$y
validation_y <- validation_knn$y
training_knn$y <- NULL
test_knn$y <- NULL
validation_knn$y <- NULL
outcome_y_knn <- knn(training_knn ,validation_knn, cl=target_training, k=481)
outcome_knn <- knn(training_knn ,validation_knn, cl=target_training, k=481, prob=TRUE)
outcome_knn <- attr(outcome_knn, "prob")
conf_matrix_knn <- table(outcome_y_knn ,validation_y)
#random forest model
outcome_y_forest <- predict(model_random_forest_outcome, newdata=validation, type="response")
outcome_forest = predict(model_random_forest_outcome, newdata=validation, type="prob")[,2]
conf_matrix_forest <- table(outcome_y_forest ,validation_y)
We can now make a table with statistics:
fourfoldplot( conf_matrix_reg)
fourfoldplot( conf_matrix_knn)
fourfoldplot( conf_matrix_forest)
confusionMatrix(conf_matrix_reg)
## Confusion Matrix and Statistics
##
##
## outcome_y_reg 0 1
## 0 4419 346
## 1 450 405
##
## Accuracy : 0.8584
## 95% CI : (0.849, 0.8674)
## No Information Rate : 0.8664
## P-Value [Acc > NIR] : 0.9619402
##
## Kappa : 0.4221
##
## Mcnemar's Test P-Value : 0.0002615
##
## Sensitivity : 0.9076
## Specificity : 0.5393
## Pos Pred Value : 0.9274
## Neg Pred Value : 0.4737
## Prevalence : 0.8664
## Detection Rate : 0.7863
## Detection Prevalence : 0.8479
## Balanced Accuracy : 0.7234
##
## 'Positive' Class : 0
##
confusionMatrix(conf_matrix_knn)
## Confusion Matrix and Statistics
##
## validation_y
## outcome_y_knn 0 1
## 0 4030 294
## 1 839 457
##
## Accuracy : 0.7984
## 95% CI : (0.7877, 0.8088)
## No Information Rate : 0.8664
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3338
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8277
## Specificity : 0.6085
## Pos Pred Value : 0.9320
## Neg Pred Value : 0.3526
## Prevalence : 0.8664
## Detection Rate : 0.7171
## Detection Prevalence : 0.7694
## Balanced Accuracy : 0.7181
##
## 'Positive' Class : 0
##
confusionMatrix(conf_matrix_forest)
## Confusion Matrix and Statistics
##
## validation_y
## outcome_y_forest 0 1
## 0 4251 307
## 1 618 444
##
## Accuracy : 0.8354
## 95% CI : (0.8255, 0.845)
## No Information Rate : 0.8664
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3951
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8731
## Specificity : 0.5912
## Pos Pred Value : 0.9326
## Neg Pred Value : 0.4181
## Prevalence : 0.8664
## Detection Rate : 0.7564
## Detection Prevalence : 0.8110
## Balanced Accuracy : 0.7321
##
## 'Positive' Class : 0
##
For the regression F-score was around 0.5, for knn 0.48, and 0.34 for the random forest. As we can see all the accuracies are high, but the one from the regression is the highest (same goes for the F score). We should also note that the logistic regression model was the worst in terms of predicting the 1 class (but not by a lot), in this case the KNN was the best one. The best thing about regression model is that in predicts pretty well both classes, while the KNN has a big issue with predicting 0. We can also look at the ROC curves:
#LEGEND_LABELS = c("regression", "random forest", "knn")
pred_reg <- prediction(outcome_reg, validation$y)
perf_reg <- performance(pred_reg,"tpr","fpr")
pred_forest <- prediction(outcome_forest, validation$y)
perf_forest <- performance(pred_forest,"tpr","fpr")
pred_knn <- prediction(outcome_knn, validation_y)
perf_knn <- performance(pred_knn,"tpr","fpr")
plot( perf_reg, col="red", main="ROC curves on validation" )
par(new=TRUE)
plot( perf_knn, col="blue")
par(new=TRUE)
plot(perf_forest, col="green")
legend(x = "bottomright", legend = c("regression AUC 0.79", "random forest AUC 0.77", "knn AUC 0.59"), col = c(2, 3, 4),lwd = 2)
As we can see the ROC curve on the knn seems to always below curves from the regression and random forest, it also got very low AUC. The curve from the logistic regression seems to almost always be a bit above random forest curve, so I would say that by this graph and AUC the regression is the best model.
Overall, I think that the expanded regression is the best one: it has the highest F score, highest AUC and predict both classes very well. It also computes extremely fast and we can easily see the parameters used (so it’s not a blackbox). For the second best I would pick the knn model, because its very good at predicting the minority class, it also has pretty good F score. The biggest drawback of the random forest is that its medicore in predicting both classes, and the computation takes quite a while.
I think that the data preprocessing part was too general and not made for any specific model, which may have resulted in worsening the models. For the regression we should have removed the collinearities, for the regression model, but it would likely make the results on the decision tree worse. For the knn we should look for the noise, but as far as I know its hard. For the random forest there is no need for excessive data preprocessing, doing so may make the outcomes worse but will also make the computation time lower (less trees are needed for the unbiased result). The good part is that overfitting was never a problem, so we had propably removed a huge part of the noise in data. We could also include the dummies for year, but its not included in the data directly (there is no such variable, even thought the observations are ordered by the consecutive months so probably also by the consecutive years).
I think that the biggest issue was deciding whether or not we should remove or merge variable. My solution for the merges was looking at the PCA graphs for few of the first components or using the k-means clustering and for the removes looking both at the PCA and correlations with the target variable.
Overall, I think that the resoults are good but there is still a lot of room for improvements.